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S&taARY  AND  CONCLUSIONS 


The  hypothesis  tested  was  that  high  agreement  among  the  ratings  assigned  the 
same  men  by  different  raters  does  noi  necessarily  i^ply  predictable  ratings. 

Ratings  by  three  superior  officers  (Officers  and  Chief  Petty  Officers)  of 
100  submariners  serving  aboard  21  different  submarines  weTe  divided  into  jiout 
samples  so  as  to  achieve  four  levels  of  inter -rater  agreement  (.00,  .69,  ,84  and 
,94).  Correlations  were  then  computed  within  each  sai^>le  between  three  predictor 
variables  (Submarine  School  Class  Standing  and  the  Navy  General  Classification 
and  Mechanical  Aptitude  Tests)  and  the  mean  of  the  three  ratings  assigned  to  each 
ratee. 

The  hypothesis  was  supported  by  the  results.  None  of  the  six  correlations 
between  the  predictor  variables  and  the  ratings  for  which  the  inter -rater  agree¬ 
ment  estimates  were  high  (.04  and  .94)  was  significantly  different  from  zero. 

Four  of  the  six  correlations  coaqputed  foT  the  low  agreement  ratings  (.00  and  .69) 
were  significantly  different  from  zero,  one  at  the  .01  level  and  three  at  the  .05 

Wei, 

It  was  concluded  that  high  inter-rateT  agreement  does  not  necessarily  i^ply 
predictability  and  may  indicate  a  lack  of  it.  Low  agreement,  on  the  other  hand, 
may  in  some  cases  indicate  predictability  and  possibly  validity. 


THE  PROBLEM 


Hfhen  ratings  are  used  as  criterion  measures,  more  "ultimate"  criteria  of  per 
formance  are  generally  not  available  for  validating  them;  indeed  if  a  more  ulti¬ 
mate  measure  were  available,  ratings  probably  would  not  be  employed  in  the  first 
place.  It  is  necessary,  therefore,  either  simply  to  accept  the  ratings  as  valid 
or  to  seek  some  indirect  indication  of  their  validity.  One  such  indication  often 
employed  is  the  reliability  of  the  ratings  as  shown  by  the  amount  of  agreement 
among  scores  assigned  the  some  ratees  by  different  raters.  Another  is  the  predic¬ 
tability  of  the  ratings,  or  the  extent  to  which  they  correlate  with  measures  to 
which  they  should  be  related,  according  to  logic  or  the  results  of  previous 
research. 

The  hypothesis  tested  here  is  that  inter -rater  agreement  and  predjptafrUUjf. 
may  be  incon^atible  indications  of  validity.  More  specifically,  it  is  hypothe¬ 
sized  that  high  inter -rater  agreement  is  not  necessarily  indicative  of  predicta¬ 
bility  and  that  disagreement  among  raters,  on  the  other  hand,  may  be  associated 
with  predictability  and  possibly  validity. 


KETHOO 


A  saaple  of  100  submariners  was  divided  into  four  equal  groups  according  to 
the  degree  to  which  the  ratings  assigned  them  by  four  superiors  were  in  agreement 
thus  four  different  levels  of  inter-rater  agreement  were  obtained.  Correlational 
analyses  were  then  performed  within  each  group  to  determine  the  relative  predic¬ 
tability  of  the  ratings  from  scores  on  two  aptitude  tests  and  from  final  class 
standing  at  the  Submarine  School,  New  London. 


The  Rating  Scale.  Assessments  of  the  qualifications  of  candidates  for 
the  Submarine  School  are  routinely  made  by  psychiatrists  on  the  staff 
of  the  U.S.  Navy  Medical  Research  Laboratory.  The  rating  scale  em¬ 
ployed  in  this  study  was  originally  designed  for  the  purpose  of  gath¬ 
ering  ratings  to  determine  the  validity  of  these  assessments  for  sub¬ 
sequent  performance  aboard  submarines. 

Since  the  rating  scale  and  its  development  have  been  described 
fully  in  other  reports  (4,  5.  6),  only  a  brief  description  will  be 
given  here.  A  general  trait  scale  was  developed  containing  10  traits 
considered  to  pertain  to  the  technical  aspects  of  a  man’s  job  and  10 
considered  to  pertain  to  the  personal  adjustment  aspects.  The  format 
of  the  scale,  an  exanple  of  which  is  given  on  the  following  page,  was 
designed  so  that  the  rater  assigned  scores  to  all  the  men  he  was 
rating  on  one  trait  at  a  time.  He  assigned  the  ratings  on  a  scale 
of  25  hypothetical  submariners  of  the  same  rate  and  pay  grade  as  the 
sen  being  rated.  Each  rater  assigned  his  ratings  independently. 

For  most  of  the  analyses  reported  here,  only  the  means  of  the 
ratings  assigned  each  ratee  by  each  of  his  three  raters  on  the  ten 
technical  competence  traits  were  employed. 

The  Samples.  171  men  aboard  21  different  submarines  of  the  Pacific 
Fleet  were  rated  by  three  of  their  superiors,  either  by  two  Officers 
and  a  Chief  Petty  Officer  (CPO)  or  one  Officer  and  two  CPOs.  The 
raters  on  each  boat  were  selected  on  the  basis  of  their  professed 
knowledge  of  the  men  to  be  roted. 

In  an  attenpt  to  control  the  length  of  time  the  raters  had 
known  the  ratees,  those  men  who  had  been  aboard  their  boats  for  a 
period  of  at  least  ten  months  were  selected  from  this  total  sample 
for  the  investigation  reported  here.  Since  there  were  97  such  men. 
an  additional  three  men  were  randomly  selected  from  the  group  that 
had  been  aboard  for  nine  months  and  addeo  to  the  sample  to  make  it 
an  even  100  ratees. 

The  differences  between  the  means  of  the  ten  technical  compe¬ 
tence  trait  ratings  assigned  each  ratee  by  the  three  men  rating 
him  and  the  mean  of  those  three  means  were  squared  and  added  to¬ 
gether,  yielding  what  was  called  an  "agreement"  score  for  each 
ratee.  That  is,  a  ratee' s  agreement  score  indicated  the  extent  to 
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which  his  three  raters  agreed  in  their  ratings  of  him.  The  distribution 
of  the«e  scores  was  divided  at  the  25th,  50th,  and  75th  centiles  to 
vi&id  four  groups  of  25  ratees  each.  Hereafter  these  experimental  sam¬ 
ples  will  be  called  the  High  Agreement  WA) ,  Moderate  Agreement  (MA) , 
Moderate  Disagreement  (MD) .  and  High  Disagreement  (KD)  groups. 


Estimates  of  Inter-Rater  Agreement.  The  means  of  the  three  ralin9? 
assigned  to  each  ratee  were  used  in  the  correlational  analyses  performed 
to  test  the  hypothesis  and,  thus,  were  regarded  as  the  ratees  true 
scores.  The  deviations  of  the  three  ratings  assigned  a  man  about  his 
Mtrue"  score  were  then  regarded  as  error.  To  estimate  inter -rater 
aoreement,  these  deviations  were  squared,  summed  and  treated  as  lhe 
error  variance  term  in  the  basic  equation  for  the  coefficient  of  relia¬ 
bility  (2),  The  variance  of  the  mean  ratings  or  "true"  scores  was  uaeo 
as  the  total  variance  term  in  the  equation.  The  inter -rater  agreement 
estimates  obtained  in  this  manner  are  given  in  Table  1. 


Table  1 

ESTIMATES  OF  INTER  -RATER  AGREEMENT 
FOR  THE  FOB  EXPERIMENTAL  SAMPLES 


Saacle 

In. 

High  Agreement 

.94 

Moderate  Agreement 

.84 

Moderate  Disagreement 

.69 

High  Disagreement 

,00 

Because  of  the  way  the  sarples  were  selected,  the  inter -rater  agree¬ 
ment  estimates  for  the  two  agreement  samples  were  higher  than  the  esti¬ 
mates  for  the  disagreement  samples.  The  inportant  thing  to  note  is  that 
statistics  such  as  are  shown  in  Table  I  are  often  presented  to  indicate 
the  relative  acceptability  of  obtained  ratings  as  criterion  measures 
and  that  the  ratings  showing  the  higher  inter -rater  agreement  estimates 
would  probably  be  preferred  by  most  researchers  conducting  validation 

studies. 

The  Criteria.  Three  measures  were  used  to  compare  the  predictability 
of  the  ratings  assigned  to  the  men  in  the  four  samples:  the  Navy  General 
Classification  Test  (GCT) ,  the  Navy  Mechanical  Aptitude  Test  (MEOO ,  and 
Submarine  School  Class  Standing  (SSS) .  Previous  research  has  shown  that 
these  variables  are  significantly  related  to  performance  aboard  submarines 
as  measured  by  ratings,  check  lists,  and  job  stable  performance  tests  (7). 
Submarine  School  Class  Standing,  which  is  based  on  a  composite  of  written 
achievement  test  scores  and  instructor  ratings  and  has  an  estimated  reli¬ 
ability  of  .90,  was  found  to  correlate  higher  with  scores  on  the  ship¬ 
board  criteria  than  any  of  a  variety  of  predictor  variables  studied.  It 
was  selected  from  the  measures  available  for  this  study,  therefore,  as 
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the  variable  most  Ukely  to  be  related  to  the  ultimate  criterion  and  as 
probably  the  best  indicator  of  the  validity  as  well  as  the  predictability 
of  the  ratings. 

lable  2  shows  the  weans  and  standard  deviations  of  the  scores  of  the 
men  in  the  four  sanqples  on  the  three  predictor  variables.  Only  the 
difference  between  the  variances  of  the  scores  of  the  HA  and  HD  sables 
on  the  MECH  was  signif icantly  different  fro*  zero.  Because  of  the  num¬ 
ber  of  tests  made,  one  such  result  was  expected  by  chance  alone;  there¬ 
fore,  the  samples  were  considered  to  have  been  obtained  from  the  same 
population  with  respect  to  the3e  variables. 


Table  2 

MEANS  AND  STANDARD  DEVIATIONS  OF  SCORES  ON 
THE  PREDICTOR  VARIABLES  L'QR  THE  FOUR  SAMPLES 


SSS  GCT  *»ECH 


Saople 

N 

« 

a 

M 

a 

M 

a 

High  Agreement 

25 

53.80 

29.35 

58.84 

6.77 

60.56 

11.12 

Moderate  Agreement 

25 

45.52 

32.20 

59.62* 

6.70 

57.06 

8.67 

Moderate  Disagreement 

25 

48.64 

30.81 

61.56 

6.90 

58.16 

9.75 

High  Disagreement 

25 

52.68 

27.30 

58.80 

6.48 

59.52 

6.48 

*  N=24 


i.ennth  of  Time  on  Board.  Table  3  shows  that  the  four  samples  were  also 
homogeneous  with  respect  to  the  amount  of  time  the  ratees  in  each  group 
had  spent  aboard  the  submarine  on  which  they  were  rated. 


Table  3 

MEANS  AND  STANDARD  DEVIATIONS  OF  LENGTHS 
OF  TIME  SPENT  ON  BOARD  FOR  THE  FOW  SAMPLES 


Months 

Saraole 

M 

g— 

High  Agreement 

12.64 

2.33 

Moderate  Agreement 

13.16 

2.19 

Moderate  Disagreement 

12.68 

2.36 

High  Disagreement 

13.32 

2.84 

l 

l 
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n/»<rrintl Gn  of  the  Ratings  Assigned.  Table  4  shows  the  weans  and  *tan- 
dard  deviations  of  the  ratings  assigned  the  wen  in  the  four  experimental 
saoeies.  None  of  the  differences  between  weans  or  variances  was  signiri 
cantly  different  from  zero,  although  there  was  a  tendency  for  the  rating 
of  the  Hi)  group  to  be  slightly  less  variable. 

Table  4 

MEANS  AND  STANDARD  DEVIATIONS  OF  THE 
RATINGS  ASSIGNED  THE  FOUR  SAMPLES 


Sample 


N 


High  Agreement 
Moderate  Agreement 
Moderate  Disagreement 
High  Disagreement 


25 

16.00 

3.69 

25 

13.54 

4.70 

25 

14.65 

4.03 

25 

14.97 

3.28 

Navy 

GCT  and 

MECH  aj 

and  Submarine  School  Class  Standing  were  correlated  with  the  mean  of  the 
ratings  assigned  each  ratee  by  the  three  superior  officers  who  rated  him. 
Separate  analyses  were  performed  for  each  of  the  four  experimental 
groups.  The  score  actually  used  for  Submarine  School  Class  Standing 
was  the  proportion  of  men  in  his  class  the  ratee  exceeded;  thus  a  man 
who  was  first  in  his  class  received  a  score  of  1.00  while  a  man  *4io 
was  last  had  a  score  of  .00.  Pearson  product -moment  coefficients  were 
computed. 
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RESULTb 


Native  Predictability  ?f  th<?  Hjtjnflj 

11,0  results  o l  the  correlational  analyses  are  shown  In  Table  5. 


Table  5 

correlations  eetoeen  scores  on  the  eredictor  V^MLESAl® 
RATINGS  OF  THE  REN  IN  THE  FOUR  EXPERIMENTAL  SAMPLES 


Sample 

High  Agreement 
hoderate  Agreement 
koderate  Disagreement 
High  Disagreement 


Ill 

GOT 

kECH 

N_ 

.94 

.05 

-.23 

-.07 

25 

.84 

.29 

-.14 

.02 

25 

.69 

.61** 

.43* 

.42* 

25 

.00 

.43* 

.17 

.10 

25 

*  significant  at  the  .05  level 
*♦  significant  at  the  .01  level 

.  fnr  ail  three  predictor  variables.  As  the  inter-rater 

The  trend  was  the  same  for  all  tnree  premia 

reement  estimates  decreased  from  the  Ha  to  the  IB  sample,  the  correlation,  between 
e  predictors  and  the  rating,  increased.  Even  for  the  sample  for  which  the  inter¬ 
net  agreement  estimate  was  .00.  the  correlation  between  Submarine  School  Class 
.ending  and  the  ratings  was  significantly  different  from  aero  (.05  level). 

Figure  1  shows  more  clearly  that  the  relationship  between  predictability  and 
,ter-rater  agreement  may  be  curvilinear.  Predictability  -  as  indicated  b,  the 
^relations  between  SSS,  OCT,  and  MECH  scores  and  the  mean  rating  assigned  each 
a„  _  increased  as  Inter-rater  agreement  increased  to  about  .70.  At  that  point, 

he  correlations  dropped  very  rapidly  to  insignificant  values. 

It  should  be  pointed  out  that  there  were  no  data  points  between  inter-rater 
igreement  estimates  of  .00  and  .69.  It  eopears  conceivable  from  the  slope  of  the 
:urve,  between  the  estimates  of  .69  and  .94  that  the  correlations  might  have  been 
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Correlation  between  Predictor  and  Rating 


higher  for  io.er-r.t.r  ^recent  e,U«tr.  h*««e„  .40  »d  .60. 


Figure  1 

COBREUTIOli  BETWEEN  SCOTIES  ON  SSS,  6CT,  AND  «Cl1 
AND  BEAN  TEC.WIOL  COBk-ETENC^.  RATING 
l  FUNCTION  OF  E^TER -RATER  *GRuEk£MT 


m  liwllar  analysis  was  performed  using  the  mean  of  the  ratings  on  the  ten 
personal  adjustment  traits.  The  results  of  this  analysis  are  shown  in  fable  6. 

The  inter-rater  agreement  estimate  lor  the  a*  group  was  slightly  higher  than 
that  for  the  HA  sample.  The  absolute  magnitude  of  the  sum  of  the  squared  devia¬ 
tions  of  the  three  ratings  assigned  a  man  about  his  mean  rating  (or  "true"  score) 
was  greater  for  the  M  than  for  the  HA  sample.  However,  the  variance  of  the  mean 
ratings  of  the  to  group  was  also  greater  which  accounts  for  the  higher  inter-rater 

agreement  estimate. 

Table  6 

CORRELATIONS  BETWEEN  THE  PREDICTOR  VARIABLES  AND 
PERSONAL  muJUSTwENT  RATINGS  FOR  THE  FOUR  SAkPLES 


b ample 

N 

L  rr 

sss 

GCT 

jnECH 

.Sigh  agreement 

25 

.86 

.05 

.02 

.16 

Moderate  Agreement 

25 

.90 

.25 

-.27 

.01 

Moderate  Disagreement 

25 

.61 

.06 

-.12 

.21 

High  Disagreement 

25 

.12 

.65** 

.44* 

.06 

*  significant  at  the 
**  significant  at  the 

.05  level 
.01  level 

Only  two  of  the  correlations  were  significantly  different  from  zero  (.05  level) 
and  both  were  in  the  sample  for  which  the  inter-rater  agreement  estimate  was  lowest 
(.12).  There  was  no  apparent  tendency,  comparable  to  that  found  with  the  technical 
coc  etence  ratings,  for  predictability  to  increase  as  inter-rater  agreement  decreased 
from  the  two  agreement  groups  to  the  kD  group.  Figure  2,  on  the  following,  pejo, 

3hor*s  these  sarc.e  results  in  graphic  form. 
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Correlation  between  Predictor  and  Rating 


Figure  2 


CORRELATION  BETWEEN  SCORES  ON  S3S,  GCT ,  AM)  MECH 
AM)  MEAN  PERSONAL  ADJUSTMENT  EATING  AS  A  FUNCTION  OF  INTER -EATER  AGREEMENT 


i 


ln^. tin. Hon  of  Poaaiblc  Source}  of  ITrtttUM* 

additional  analyses  were  performed  in  an  effort  to  locate  the  source  of  the 
predictable  variance  in  the  ratings  for  which  the  inter-rater  aqree«nt  estimates 
were  low,  i.e.,  the  »  and  HO  groups.  Only  the  ratings  on  the  ten  technical  com- 

petence  traits  were  used  in  these  analyses. 

First  it  was  hypothesized  that  the  more  extreme  rating  with  respect  to  the 
mean  of  the  three  assigned  a  ratee  was  contributing  more  predictable  variance  than 
the  other  two.  The  hypothesis  was  based  on  the  idea  that  a  more  deviant  rating 
might  possibly  indicate  better  observations  of  ratee  behavior.  The  procedure  em¬ 
ployed  in  testing  the  hypothesis  was  as  follows:  the  ratings  assigned  the  50  ratees 
in  the  combined  HD  and  HD  samples  were  plotted  on  a  large  chart.  Inspection  showed 
that  all  three  raters  disagreed  in  their  evaluations  of  some  .nen.  In  the  case  o' 
others,  two  of  the  raters  were  in  substantial  agreement  and  only  the  third  disagreed. 
The  25  ratees  (half  of  the  combined  sample)  for  whom  this  latter  pattern  was  most 
pronounced  were  selected  for  study.  Scores  on  the  predictor  variables  were  then 
correlated  both  with  the  extreme  rating  assigned  each  of  these  25  men  and  with  the 
mean  of  the  other  two  ratings  given  them.  The  results  are  shown  in  Table  7. 


Table  7 

CORRELATIONS  BETWEEN  SCORES  ON  THE  PREDICTOR  ATTMrc 

AND  THE  ONE  DISAGREE  RATING  mND  THE  MEAN  OF  THE  TWO  mGREE  RATINGS 


Ratings 

Number  of  Ratees 

sss 

G£T 

kiECH 

One  Disagree 

25 

.50* 

.56** 

.50* 

Two  Agree 

25 

.35 

.30 

.41* 

♦  significant  at  the  .05  level 
**  significant  at  the  .01  level 

The  means  of  the  ratings  assigned  by  the  two  raters  who  were  in  close  agreement 
were  less  predictable  than  the  ratings  assigned  by  the  rater  who  disagreed,  (it 
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should  be  pointed  out  that  the  Means  and  variances  of  these  two  samples  of  ratings 
were  not  significantly  different.) 

The  sane  sort  of  analysis  was  performed  using  the  entire  combined  l*D  and  ft) 
samples.  In  this  ca»e,  of  course,  the  agreement  between  the  "two  agree"  raters  was 
not  as  great,  and  in  some  cases  the  rating  of  the  one  "disagreed  rater  was  not  much 
farther  removed  from  the  atean  of  the  three  ratings  than  the  rating  given  by  one  of 
the  "two  agree"  raters.  As  shown  in  Table  8,  the  two  sets  of  ratings  were  almost 
equally  predictable  from  SS5.  However,  scores  on  the  GCT  and  k£CH  variables  corre¬ 
lated  significantly  (,05  level)  with  the  nore  deviant  rating  and  not  with  the  mean 
of  the  ratings  assigned  by  the  two  raters  who  were  in  closer  agreement  in  their 
evaluations. 

Table  8 

CORRELATIONS*  BETT/EEN  SCORES  ON  THE  PREDICTOR  VARIABLES 
AM)  TIC  ONE  DISAGREE  AM)  TIC  1CAN  OF  THE  TWO  AGREE  RATINGS 


(for  the  M)  and  HD 

samples 

combined) 

Ratines 

Number  of  Ratees 

QGL 

t££H 

One  Disagree 

50 

.42+* 

.30* 

.29* 

Two  ngree 

50 

.44** 

.19 

.25 

•  significant  at  the  .05  level 
*•  significant  at  the  .01  level 

There  was  a  slight  but  statistically  insignificant  tendency  for  the  raters  who 
assigned  the  more  deviant  ratings  to  be  Chief  Petty  Officer  rather  than  Officer 
raters.  Correlational  analyses  indicated  that  the  ratings  assigned  by  these  eo» 
listed  raters  were  also  more  predictable  from  scores  on  the  two  aptitude  tests  than 
were  the  ratings  assigned  by  officers.  However,  as  can  be  seen  in  Table  9,  the  two 
correlations  with  Submarine  School  Class  standing  were  equal. 
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i /».*•  t h#»  Ml  and  HD  samples  co«bined) 


GaiSIi 

Number  of  Raters 

QSL 

CP  Os 

28 

.42* 

.41* 

.44* 

Officers 

22 

.42* 

.01 

-.21 

•  significant  at  the  .05  level 


DISCUSSION 


The  results  indicate  that  high  inter-rater  agreement  does  not  necessarily 
imply  predictability  in  ratings.  Whether  or  not  the  results  can  be  interpreted 
to  mean  that  inter-rater  agreement  is  not  necessarily  a  good  index  of  the  validity 
of  ratings  depends  on  whether  or  not  one  is  willing  to  accept  the  assumption  that 
the  predictor  variables  employed  in  the  study  are  positively  related  to  the  ulti¬ 
mate  criterion  of  the  performance  that  was  rated.  Even  if  that  assumption  cannot 
be  made,  the  results  indicate  that  at  least  in  some  instances  inter-rater  agreement 
and  predictability  would  yiel<i  incompatible  indications  of  the  validity  of  ratings. 

Submarine  School  Class  Standing  has  been  shown  to  be  more  highly  related  to 
various  criteria  of  shipboard  performance  than  any  of  a  variety  of  predictor  vari¬ 
ables  studied  (7).  Thus,  the  assumption  with  respect  to  the  relationships  between 
the  predictor  variables  used  in  this  study  and  the  ultimate  criterion  of  perior- 
mance  aboard  submarines  is  probably  more  tenable  for  the  SSS  variable.  It  is  inter¬ 
esting  to  note  in  the  light  of  this  that  it  was  also  the  variable  that  showed  the 
most  significant  positive  relationships  with  the  ratings  for  which  the  inter-rater 
agreement  estimates  were  low. 

The  explanation  cf  the  results  proposed  here  is  based  first  on  the  assumption 
that  ratee  behavior  in  most  performance  rating  situations  is  not  entirely  consistent 
from  one  time  to  the  next  with  respect  to  particular  traits,  primarily  because  no 
effort  is  made  to  control  the  physical  and  psychological  environment  during  the 
period  the  ratings  are  designed  to  cover.  To  be  valid,  then,  ratings  must  reflect 
these  inconsistencies. 

On  the  other  hand,  even  if  ratees  behaved  entirely  consistently,  ratings  of 
thee  would  not  necessarily  be  in  agreement  since  raters  use  different  criteria  in 
rating  on  the  same  trait  (3).  The  second  assumption,  then,  is  that  these  criteria 
employed  by  different  raters  are  all  valid  and  the  differences  in  ratings  reflects 
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b,  them  are  also  valid.  Thia  «co.d  ““V"  U,t  p“l  °f  Khi”lml  “* 

ultimate  la  performance  I.  aati.fylag  ‘He  demaad.  of  varlou.  .aperlora  by  bebariag 

in  different  w*jr*. 

Oa  thia  baala.  high  agreement  among  ratiag*  coaid  la^ly  a  poor  aa^ilag  of 
obseraatloaa  of  ratee  behavior  by  rater..  a  poor  aaiapliag  of  ratera  la  term,  of 
the  criteria  they  a.e  to  evaluate  particular  trait,,  or  both;  -«re.S  di.^^at 
among  the  r.tlag,  ...Igaed  to  the  ..me  *.  b,  differeat  rater,  could  Indicate  that 
a  more  repre.eat.tire  a«*l.  of  ob.era.tloa,  aad  rater.  *a.  obtaiaed.  Thl.  1.  coa- 
trary  to  the  u.ua!  ..  loaale  fot  he  preaeace  or  ab.cace  of  rater  agreemeat. 

It  la  obvioua  oa  the  baal.  of  theae  tuo  ..auction.  that  high  inter -eater 
agreement  could  al.o  indicate  validity.  If  a  ratee  knew  the  criteria  hi,  a.perior. 
mere  going  to  e^loy  la  rating  him.  for  ex.*le.  be  could  -  ...uming  he  had  a.ffl- 
cleat  control  of  hi.  behavior  regardlear  of  the  enrir.nmeat.l  aitu.tloa  -  behave 
So  a.  to  aatl.fy  them.  It  1.  a.aumed.  h«ever.  that  the  majority  of  men  are  not 
entirely  .««  of  the  nature  df. their  auperior,’  criteria,  aad  even  if  they  -ere. 
the,  netth'T  have  adequate  control  over  their  behavior  in  all  aitu.tloa.  aor  are 
obaequioua  enough  to  attest  to  a.ti.fy  *11  of  their  auperlora*  demand,. 


in  auaaaariiing  the  b.ae.  of  unreliability  in  rating..  Ghi.elll  and  Brown  aay. 
The  indication  i.  that  rater,  dl.agr.e  primarily  bec.u.e  the,  ob.erv.  the  iadi- 
vlduala  to  be  rated  la  differeat  aitu.tioa.  and  under  different  condition,,  and 
becauae  the,  u.e  different  criteria  for  judging  the  ..me  trait  or  characterlatic." 
(Not,  that  the  two  factor,  uhlch  they  aay  contribute  to  a  lack  of  agreement  la 
rating,  are  ...u»«d  in  the  explanation  given  her.  to  contribute  to  their  validity, 
a,  long  a,  they  are  accurately  reflected  in  the  rating..)  They  continue  b,  a.ylng, 
-I,  follow,  ftim  thi.  evidence  that  reliability  of  rating,  can  be  conaider.bly 
increaaed  b,  having  the  rater,  obaerve  the  individual,  under  ainllnr  conditions. 
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•ad  by  providing  techniques  for  making  the  rating?  that  will  iacreu*  the  likeli¬ 
hood  that  the  traits  or  characteristics  being  judged  will  be  eralaated  oa  the  same 

basis.”  (1) 

While  it  is  probably  true  that  haring  raters  observe  ratees  under  similar  con¬ 
ditions  would  increase  inter-rater  agreement,  it  night  also  serve  to  dcgrc»e  Til- 
ititr  for  the  ultimate  criterion  by  failing  to  take  into  account  all  of  those  on- 
the-job  situations  in  which  individuals  perform  and  the  interaction  between  ratees 

and  situations. 

Certain  members  of  a  work  group  sight  react  favorably  in  one  situation  and 
unfavorably  in  another.  Certainly  with  the  variety  of  situations  individuals  lace 
from  day  to  day  regardless  of  their  occupations,  they  could  not  be  expected  to 
react  favorably  in  all  of  them.  Submariners,  for  example,  live  in  a  threatening 
environment  faced  with  an  infinite  number  of  unique  situations.  Officers  and  CPOs 
cannot  observe  their  men  under  similar  conditions  simply  because  of  the  physical 
layout  of  a  submarine  and  because  of  the  diverse  jobs  individuals  in  the  same  gang 
perform.  To  develop  a  method  whereby  the  ratees  could  be  observed  under  similar 
conditions,  even  if  it  were  possible,  would  probably  iwply  the  exclusion  of  critical 
situations  in  which  a  man's  behavior  would  have  potentially  the  greatest  significance 
as  far  as  his  contribution  to  the  effectiveness  of  the  boa'  is  concerned. 

Increasing  inter-rater  agreement  by  having  raters  observe  ratees  under  similar 
conditions  might,  therefore,  defeat  the  more  important  purpose  of  obtaining  valid 
ratings.  The  results  reported  here  appear  to  support  this  reasoning. 

It  is  not  being  auggested  that  high  inter -rater  agreement  always  implies  a  lack 
of  predictability.  Essentially  none  of  the  variance  in  the  high  agreement  ratings 
and  only  a  portion  of  the  variance  in  the  low  agreement  ratings  was  predictable  from 
scores  on  the  three  variables  used  in  this  study.  It  is  conceivable  that  both  of 
these  sources  of  variation  could  be  predicted  from  other  types  of  measures. 
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CGfJCtiCIUC 


Since  more  ult'mate  criteria  of  performance  are  seldom  available  for  determining 
the  validity  of  performance  ratings,  some  indirect  indication  of  their  validity  is 
often  employed.  One  such  indication  is  the  agreement  among  the  ratings  assigned  to 
the  same  men  by  different  raters.  Another  is  the  predictability  of  the  ratings  from 
measures  to  which  they  should  be  related,  according  to  logic  or  the  results  of  pre¬ 
vious  research. 

It  is  concluded  on  the  basis  of  the  study  reported  here  that  these  two  indica¬ 
tions  are  not  necessarily  compatible.  Inter-rater  agreement  may  not  be  a  good  index 
0f  predictability.  Ratings  for  which  the  inter-rater  agreement  estimate  is  low  may 
be  more  predictable  than  ratings  for  which  that  estimate  is  high. 

Additional  studies  should  be  performed  to  determine  whether  or  not  these  results 
represent  a  chance  occurrence.  If  they  do  not,  further  studies  would  indicate  the 
circumstances  under  which  comparable  results  may  be  obtained. 
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